Patrick Cao
Insider trading, or trading financial assets using information that is not open to the public, is punishable by law in the United States. It is a serious violation -- a maximum penalty of 20 years in prison, and up to $20 million in fines. The Securities and Exchange Commission (SEC) is responsible for identifying and conducting investigations regarding insider trading.
With that being said, U.S. senators (and other congresspeople) often have access to information about upcoming policies or laws before the general public, yet are still allowed to trade securities. Senators, however, must disclose their trades publicly. More recently, a few senators have been investigated by the SEC for insider trading using nonpublic knowledge given to them about the COVID-19 pandemic. With the rise in popularity of investing, senator stock trading patterns has become to the public's attention. You may have seen posts on the internet implying that there is rampant insider trading going on in the senate. For example, take this post on Reddit that was upvoted over 30,000 times: https://www.reddit.com/r/dataisbeautiful/comments/gjlvnd/.
Is insider trading in the senate as rampant as this post suggests? This tutorial aims to identify potential insider trading patterns by the U.S. senate. In order to answer this question, we will be using data from https://senatestockwatcher.com/, which itself compiles data from https://efdsearch.senate.gov/. efdsearch.senate.gov does not have an official API, and scraping each page would be extremely tedious. Senate stock watcher aggregates financial disclosures from all senators into one nice dataset.
Let's first start by downloading the dataset. While Senate Stock Watcher provides an API for up to date datasets, we can just download a csv file as well. Download the "All Transactions" csv file from https://senatestockwatcher.com/api.html.
Let's load that into a pandas dataframe.
import pandas as pd
import numpy as np
import datetime
df = pd.read_csv('all_transactions.csv')
df.head()
Let's clean this up a little by limiting and purchases and sales only, and dropping unnecessary columns.
start_date = '2016-01-01'
end_date = '2020-11-01'
# Turn transaction_date from string into datetime
df['transaction_date'] = pd.to_datetime(df['transaction_date'])
# Remove (Full) and (Partial) after Sale type
df['type'] = df['type'].apply(lambda x: str(x).split()[0])
# Limit to only purchases and sales
df = df[(df['type'] == 'Purchase') | (df['type'] == 'Sale')]
# limit the dataset to a date range
df = df[(df['transaction_date'] > start_date) & (df['transaction_date'] < end_date)]
df = df.drop(['asset_description', 'comment', 'ptr_link'], axis=1)
df.head()
We got our senator stock trading data, but now we need actual historical market data to visualize these trades. Let's use yfinance, a python Yahoo! Finance library, to get historical market data on SPY, an ETF tracking the S&P500.
import yfinance as yf
spy = yf.Ticker('SPY')
spy_hist = spy.history(period='max')
spy_hist.tail()
Neat! Now we can start putting these two things together.
import matplotlib.pyplot as plt
# number of trades by senators
num_trades = df.groupby('senator').count().sort_values(by='transaction_date', ascending=False).head(10)
plt.title('Top 10 Most Active Senators')
plt.xlabel('Total Number of Trades')
plt.ylabel('Senator')
plt.barh(num_trades.index, num_trades['transaction_date'])
plt.show()
# number of trades by ticker
trades_by_ticker = df[df['ticker'] != '--'].groupby('ticker').count().sort_values(by='transaction_date', ascending=False).head(20)
plt.title('Top 20 Most Traded Stocks')
plt.xlabel('Total Number of Trades')
plt.ylabel('Ticker')
plt.barh(trades_by_ticker.index, trades_by_ticker['transaction_date'])
plt.show()
# values of trades
trade_val = df.groupby('amount').count().sort_values(by='transaction_date', ascending=False)
plt.title('Value of Trades')
plt.xlabel('Number of Trades')
plt.ylabel('Value of Trade ($)')
plt.barh(trade_val.index, trade_val['transaction_date'])
plt.show()
It looks like most trades by senators are small -- under $15,000 per trade -- and they like trading AAPL, which is coincidentally the largest market cap company in the world. There's a good mix of tech, pharma, communications, and retail stocks that are the most commonly traded in the senate. There's not too much insight we can pull from this, but it is interesting.
Let's now try to recreate the reddit post. We need to plot overall senator trading activity against the S&P500, or SPY.
It's a little tough to track performance exactly. The size of each trade isn't an exact number -- it's a range. Let's check out what kinds of values we're working with here.
df['amount'].unique()
Yikes. The data doesn't have the exact value of securities traded, the best thing we can probably do is estimate the size of each trade by just taking the middle value of each trade. Unfortunately, the granularity of each bucket is pretty big, but we'll have to try and make do.
# Returns the average of the given bound
def getBounds(row):
amount = row[0]
trade_type = row[1]
# This amount shows up once in the entire dataset. Let's just assume the trade is $50m for now. It won't affect our analysis too much.
if amount == 'Over $50,000,000':
return 5e7
split = amount.split(' - ')
lower = int(split[0][1:].replace(',', ''))
upper = int(split[1][1:].replace(',', ''))
mid = lower + ((upper - lower) / 2)
if trade_type == 'Purchase':
return mid
return -mid
df['mid_amt'] = pd.Series(df[['amount', 'type']].apply(getBounds, axis=1))
df
Alright, now we got some rough estimates of the value of each transaction, we can group trades together by date and find the aggregate for each day.
aggregate_action = df.groupby([pd.Grouper(key='transaction_date', freq='W'), 'asset_type', 'type']).sum()
# Groupby makes transaction_date and type an index, we want to convert those back into a column
aggregate_action.reset_index(inplace=True)
aggregate_action.head()
We are now ready to plot senator trades against the market.
spy_hist['Date'] = spy_hist.index
# since we've limited the senator trades to a date range, we want to also limit SPY history
# to the same date range
market = spy_hist[(spy_hist['Date'] > start_date) & (spy_hist['Date'] < end_date)]
def plot_trades_vs_spy(trades, asset_type):
# We have a lot of data, so we should make the figure very wide
plt.figure(figsize=(30,10))
plt.plot(market['Date'], market['Close'])
plt.xlabel('Year')
plt.ylabel('SPY price ($)')
ax2 = plt.twinx()
# scale bar graph limits s.t. 0 is centered in graph
limit = max(abs(trades['mid_amt'].min()), trades['mid_amt'].max()) * 1.1
ax2.set_ylim(-limit, limit)
ax2.spines['bottom'].set_position(('data', 0))
plt.ylabel('Amount of Securities Traded ($)')
plt.bar(trades['transaction_date'], trades['mid_amt'], width=5, label='Net transaction', color='orange')
plt.title('Senator Trades of ' + str(asset_type) + ' vs. SPY price')
plt.legend()
plt.show()
for asset_type in aggregate_action['asset_type'].unique():
total = aggregate_action[aggregate_action['asset_type'] == asset_type]
plot_trades_vs_spy(total, asset_type)
plot_trades_vs_spy(aggregate_action, 'Overall')
Let's also plot the net amount traded vs. future SPY price as well. We first need
import pandas_market_calendars as mcal
from datetime import datetime
nyse = mcal.get_calendar('NYSE')
market_open_days = nyse.valid_days(start_date=start_date, end_date=datetime.today().strftime('%Y-%m-%d'))
# returns a function that gets the spy price num_days from today
def get_future_spy_price(num_days):
def h(today):
# calculate current spy data
today_index = market_open_days.get_loc(today, method='nearest')
today_close = spy_hist.iloc[spy_hist.index.get_loc(today, method='nearest')]['Close']
# calculate future date and get spy data
future_index = today_index + num_days
future_date = market_open_days[future_index]
future_close = spy_hist.iloc[spy_hist.index.get_loc(future_date, method='nearest')]['Close']
return (future_close - today_close) / today_close
return h
aggregate_action['5_day_spy_pct_change'] = aggregate_action['transaction_date'].apply(get_future_spy_price(5))
aggregate_action['15_day_spy_pct_change'] = aggregate_action['transaction_date'].apply(get_future_spy_price(15))
# aggregate_total['30_day_spy_pct_change'] = aggregate_total['transaction_date'].apply(get_future_spy_price(30))
test = aggregate_action[abs(aggregate_action['mid_amt']) > 0]
for asset_type in test['asset_type'].unique():
print(asset_type)
testx = test[test['asset_type'] == asset_type]
print(test.head())
plt.figure(figsize=(30,10))
plt.scatter(testx['mid_amt'], testx['5_day_spy_pct_change'])
plt.xlabel('Estimated Amount of Securities Purchased/Sold')
plt.show()
plt.figure(figsize=(30,10))
plt.scatter(testx['mid_amt'], testx['15_day_spy_pct_change'])
plt.show()
# plt.figure(figsize=(30,10))
# plt.scatter(test['mid_amt'], test['30_day_spy_pct_change'])
# plt.show()
aggregate_action_by_senator = df.groupby([pd.Grouper(key='transaction_date', freq='D'), 'asset_type', 'type', 'senator']).sum()
# Groupby makes transaction_date and type an index, we want to convert those back into a column
aggregate_action_by_senator.reset_index(inplace=True)
# limit the dataset to a date range
aggregate_action_by_senator = aggregate_action_by_senator[(aggregate_action_by_senator['transaction_date'] > start_date) & (aggregate_action_by_senator['transaction_date'] < end_date)]
aggregate_action_by_senator['mid_amt'] = aggregate_action_by_senator.apply(lambda row: row['mid_amt'] if row['type'] == 'Purchase' else -row['mid_amt'], axis=1)
aggregate_action_by_senator['5_day_spy_pct_change'] = aggregate_action_by_senator['transaction_date'].apply(get_future_spy_price(5))
# aggregate_total['15_day_spy_pct_change'] = aggregate_total['transaction_date'].apply(get_future_spy_price(15))
# plt.figure(figsize=(30,10))
# plt.scatter(senator_actions['mid_amt'], testx['5_day_spy_pct_change'])
# plt.xlabel('Estimated Amount of Securities Purchased/Sold')
# plt.show()
# plt.figure(figsize=(30,10))
# plt.scatter(test['mid_amt'], test['15_day_spy_pct_change'])
# plt.show()
print(aggregate_action_by_senator)
for senator in aggregate_action_by_senator['senator'].unique():
senator_actions = aggregate_action_by_senator[aggregate_action_by_senator['senator'] == senator]
plt.axvline(0, color='black')
plt.axhline(0, color='black')
plt.grid()
plt.scatter(senator_actions['mid_amt'], senator_actions['5_day_spy_pct_change'])
plt.show()
hists = {}
# returns a function that gets the spy price num_days from today
def get_future_stock_price(ticker, today, num_days):
# print('Finding %change in price after', num_days, 'days for', ticker, 'on', today)
try:
stock_history = yf.Ticker(ticker).history(period='5y')
# calculate current stock data
today_index = market_open_days.get_loc(today, method='nearest')
today_close = stock_history.iloc[stock_history.index.get_loc(today, method='nearest')]['Close']
# calculate future date and get stock data
future_index = today_index + num_days
future_date = market_open_days[future_index]
future_close = stock_history.iloc[stock_history.index.get_loc(future_date, method='nearest')]['Close']
return (future_close - today_close) / today_close
except Exception as e:
return np.nan
stock_trades = df[df['ticker'] != '--']
# we're processing an insane amount of data here, so let's save this dataframe back to the disk
# so we can read it back easily
try:
stock_trades = pd.read_csv('stock_trading_performance.csv')
except FileNotFoundError:
print('calculating...')
stock_trades['5_day_pct_change'] = stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 5), axis=1)
print('done with 5 day')
stock_trades['15_day_pct_change'] = stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 15), axis=1)
print('done with 15 day')
stock_trades['30_day_pct_change'] = stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 30), axis=1)
print('done with 15 day')
# stock_trades.apply(lambda row: get_future_stock_price(row['ticker'], row['transaction_date'], 5), axis=1)
stock_trades.to_csv('stock_trading_performance.csv')
print(stock_trades['owner'].unique())
print(len(stock_trades['senator'].unique()))
stock_trades['5_day_pct_change'] = pd.to_numeric(stock_trades['5_day_pct_change'], errors='coerce')
stock_trades['15_day_pct_change'] = pd.to_numeric(stock_trades['15_day_pct_change'], errors='coerce')
for senator in stock_trades['senator'].unique():
for owner in stock_trades['owner'].unique():
print(senator, owner)
stock_trades_by_senator = stock_trades[(stock_trades['senator'] == senator) & (stock_trades['owner'] == owner)]
plt.scatter(stock_trades_by_senator['mid_amt'], stock_trades_by_senator['5_day_pct_change'])
plt.show()
# plt.scatter(stock_trades['mid_amt'], stock_trades['5_day_pct_change'])
# plt.show()
# plt.scatter(stock_trades['mid_amt'], stock_trades['15_day_pct_change'])
for senator in stock_trades['senator'].unique():
stock_trades_by_senator = stock_trades[stock_trades['senator'] == senator]
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
fig.suptitle('Trade Value vs. Future Stock Price % Change for Senator ' + senator)
axs[0].scatter(stock_trades_by_senator['mid_amt'], stock_trades_by_senator['5_day_pct_change'])
axs[0].set_title('5 Day')
axs[0].set_xlabel('Net Value of Assets Bought/Sold')
axs[0].set_ylabel('5 Day % Change in Stock Price')
axs[1].scatter(stock_trades_by_senator['mid_amt'], stock_trades_by_senator['15_day_pct_change'])
axs[1].set_title('15 Day')
axs[1].set_xlabel('Net Value of Assets Bought/Sold')
axs[1].set_ylabel('15 Day % Change in Stock Price')
axs[0].grid()
axs[1].grid()
# plt.
# plt.title('Trade vs. 5 Day Future Stock Price % Change for Senator ' + senator)
plt.show()